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ABSTR ACT 

Host existing automatic content analysis and indexing 
techniques are based on work frequency characteristics applied 
largely in an ad hoc manner, contradictory requirements arise in this 
connection, in that terns exhibiting high occurrence frequencies in 
individual documents are often ustful for high recall performance (to 
retrieve many relevant items) , vhereas terms vith low frequency in 
the whole collection are useful for high precision (to reject 
nonrelevant items) . A new technique known as discrimination value 
analysis ranks the text words in accordance with how well they are 
able to discriminate the documents of a collection from each other; 
that is, the value of a term depends on how such the average 
separation between individual documents changes when the given tern 
is assigned for content identification. The best words are those 
which achieve the greatest separation. The discrimination value 
analysis accounts for a number of important phenomena in the content 
analysis of natural language texts: (a) the role and importance of 
single words; (b) the role of juxtaposed words (phrases) ; (c) the 
role of word groups or classes, as specified in a thesaurus. 
Effective criteria can be given for assigning each term to one of 
these three classes, and for constructing optimal indexing 
vocabularies. (Author) 



o 

ERIC 





+ s o **• . £ o < 

£ O » 4 -S> 



! 



(A 



CM 
•H 



o * 

n 

§ * 
S 



I ? S 



I I i i i 

f (ft* A «H 




c: 



I 



a 



?: 

a 



». i 



2 a 



0) 



ft* 

S 



0 
I 

s 



JS 5 



I 



in 

C V 

I 5 



to 

ft* 

CV 

ft* O 

V >» 
R «- 



1 i 

! 1 



9 



S 



0 o 

1 § 



- b Z 



8 



> 

5 cn 

o I* 
*» i« 

§ S 
41 



rl 



ft) o 

d> ci 

£ 8. 

♦* tl 



J. 
♦ * 

•i 

Pi 

to 

to 



01 



i 

•i» 
S* 

«0 

« 
> 

ftj 

I? 

H 

o 
•o 

«l 

«* 



. t 

r; 
o 



? 
S 

♦* 

n 

ftJ 



0» 
fa 



6 
I 



co 

I 

§ 
§ 



ft* 
ft 

X 

s 
s 



& M 



3 

ft* 

I 



i 

g 



to 

§ i 



•c: 



6 * 
ft* /j 



I 



: 1 

0 « * 

1 s I 

SL 3 * 

3 < « 

i I 



SI 




CO 
•X 

! 



0> 

I 



1 s 



4 

1 



2 



! 



43 
1 I 



I 



5 * I I 



CO 
9 



0> 

£ 

♦ » 



5 g 



u 

CO 



s 



o 



to 



3 

S *3 



« 8 



.6 



E 5 

g 8 

> «e 

•H U 



J 

o 

CO 

•< 
t: 
v 

8 



(0 



(ft4 

o 



I 



qi wi cn 



u 

V) 



s 

*> 

ft* 

g 

o 

J! 



9 

CO 



•H 

CO 

CP 

i 



CO 
01 
10 

I 



to 

X 

s 

CO 

o 



1 



•-4 



J 



IS 



CO 



CO 

SI 

CO 

to 

O 

& 

CO 

p. 



^ P 

0 o 

01 01 

& 2 



4-* O 



(0 

(3 



CO 

o 

I 



a 

s 

ft* 

& 
I 

s 

23 



(0 
10 

If 



o 

s? 



g 

o 

1 



VI 



0 a 

irt ft* 

1 i 

< 

? g 



(0 

o 



If Js 



4) 
C 



0) 

u 

g 

o 
tn 

c* 

V 

ft* 

•J 

fl. 

CI 



g 



or 

ERIC 




D 

I 

O 

o 

s 

.8 



5 § 



•8 



♦* 
U 



£ 
•8 

4-» 

g 
S 

b 



5 



It 



u 

£ 



w* -*5 w 

(U *H Q) 

I 8 f 

I I ? 



8 8 



1 £ 



8 



"8 



o 

w 



N 8 

8 & 

* "3 

8 3 

8 * 

i i 

! i 

a & 



I 



8 o 



5 

(0 

«» 
•#1 

Si 

CI 

a.' 

& 

J. 



•8 



i 8 



5 3 



J8 3 



C W tl 



8 

O 
M 



A) 



C 

«M 

r. 

•r * 
W 

O 

r 
o 
•• i 

4« 
O 

v 

r-1 

tJ 
«* 

8 

y 



JF 1 

to 

8 V 
| 8 

6 W 




i: 
:i 



► 4 



4- 



n 

I 



S 



5 



8 



w 8* 
8 * 



8 



2 | 



O 
CO 



« 1 



o 

8 

10 



5» 

f 



5 

i i ! 

.5 8 6 

' ; i 

8 g 

I § 

* 5 



8 
1 



8 



43 





W4 



00 



a r. 



1 



8 

W 

•S 

(ft 



8 

(44 

0 



s 

ID 



a* 



•5 H 



CM 



o «j 

u «o 

3 g 

§ 1 



Oo 



O 



5 \ & 



o 

o 

r 8 

s i 



o 
m 

O »4 
*» 

CP CD 

* 6 

O „ 



ft* 

41 



1 

1 



1 



K 



9* 

b 



Si 



§ ! 3 

I 8 8 

* : : 

& & & 

•h •» a 

c •»< 

Ol ft *H 

CD *H « 



8 

0 



1 



t 

Cm 

•8 
AS 

•s 

I 

.8 

o 

4* 

g 

•H 

Jl 
I 



1 



3 



! 1 1 



i 



M 

& 9 

U Im 



» 6 



1 

! 

lo 



o g 

.a s 
J s 

4-* 

& £ 

3 v 

$ ^ 

I 
§ 

u 

i 

to 
V 
cn 



g 

a 






ERJC 




ftl * 



e 

i 

0» 

£ 



I 



i 



i4 



s 



V 

0 



0 



frurt-T of Top ICO 
Disci 1 ! nin.ttora in 



50 



MO 



30 



20 



10 



50% 



DF « 1 
2599 Terms 



60% 



Molars Collection 
UbO Docw-uta 
4726 Tcrmn 



DF 1*2 
3238 Tarn* 



OF 1-7 
4171 Tom 



OF 1-6 
4086 Terns 



DP 1-6 
3345 Tenns 



DF 1-2 
3600 Terms 



70% 



80% 



90% 



BEST COPT AVA/LABlf 



Percentage of 
Low Frequency . J 
Terns 

i 



^jnber of Good Discriminators Among 
Low Frequency Terms 
Fig. 5(a) 



::.L-b**r of Top 100 
M«*crimlnatcrs In 
Range 



vi Collection 

Documents 
;u0^8 Terms 



DF 1-12 
12*70 Terms 



DF 1-10 
17202 Term? 



OF 1-8 
11872 Terms 



?0 



0 L 



DF 1-2 
8916 T-rn-3 




»>0* 



DF 1-6 
tum Terms 



DF 1-U 
10579 Terms 



DF t-3 

9 3»l T<*rms 



70% 



80% 



90% 



Percentage of 
Low Frequency 
Terms 



HtmbT » : c >"1 nlscrimln.it iv. % Among Low Fr^u^ney Terms 



ERLC 



T\f.. Ah) 



NwuKr of Top 100 
Dl^riniu »t >4 , J in 
Par.** 



25 



20 



15 



10 



K-tUrs Collection 
*/^6 Ttnv ■; 



BEST COPT AVUUBIE 



ZT 12-134 
2 , >6 Terms 




Percentage 
of High 
i > Frequoney 
* Terms 

H7| Percentage of 
Term Aasigratnt 



Number of Good Discriminators Among High Frequency T< 

Fig. 6(a) 



Mumtcr ot* r. ;> 209 
Diacr in i ravers in 



5 . 



9 

ERIC 



Tine Collection 
14099 Terms 




i" :-r.- ni 



DF 37-271 

»8 T*"T193 



45% 



DF 27-271 

Tr.rms 



S3% 



' u.:.*r of Gc,*! Jh.oriainatorn J.-.*,m; i:lgh •Predion :y Terns 
Fig. 6(b) 



DF 23-271 
90i Terns 



DF 25-271 
821 Terms 



Percentage 
of High 
I | » Frequency j 

I rentage J 

57% o: Term 

Assignment 



BEST COPY AWHUHtf 



Lei't-tn Kir.ht 

Rts'ill Improving 



kfp.ht-to-Uft 
Precis ton Improving 



Poor 

PLvrimlnators 



u 



n/100 



0*rtt Discriminators 



Worst 
Discriminators 

^UJJILUJJUIJ | % 



n/10 



n/2 



Document 
-> Frequency 
(n Documents 
in all) 



of Twns 



?t>* P of Terms 



H% of Terns 



Summarization of Discrimination Value of 
Terms in Frequency Rmges 
Fig. 7 



(50% of Term 
assignments) 



»*1 



9 

ERIC 



-i 

•3 
-1 

if 



3 

n 



>: en a 

. ► M» 

• n fit 



■3 



I I 



... ?J 




8 



3 g 

8 1 



a: 



8 11 



'A 



S3 



Ml 

§ * i 
3 * 



I 

8 



1 

H 



91 



to 



01 

o 



c » 



1) 

Q» 

5> 



V) 
m 

ft u 
t> V fti 

•h ♦» n 
bo ii rj 



I 

F B P 4> « o 

C U 01 $ U li 

O U * *•"« o 



to 



en 

CO 



CM tf> 



1 



8 
p 



50 

8 ~ 

4> M 



8 

n 

I 



b 8 

Ifl 111 

S3 



to 



m 

V) 

oi v p 



! « 





8 1 



o ID h a »n 



o 



o «n | o 



O (0 



£ O W 

8 p fc 



y S 1 
o l. IN 



IE 
I* 



Hi V 1*^ 

U ti U» 

£ It y k 



«4 

° F 



I I 

l O 00 CM I 



m oo N 



4 L 



I 



J L 



3 I 3 



4> CO u> I 

ko co h oi o 

* N o> tn o 

CO o> ■ • • «** 



is, U1 i « w 1 a 

s q a s u r. s 

8 8 m I o » *- I o 



>: 

p 



t 



t 



I I 

H I ft *« 1 *« «^ 



■r 1 



in in o h © at Ot O 



4 1- 



8 3 t: 
i i i 

A m r) 



-I » 

t> I «-* 



a> *-« I * 

to r* ■ rc 

1 1 1 

CD CO 

•1 CO I 9 



2 R 

«4 tN 



m <o o* o o» «0 J *1 2 

r-t it r( H «H CM 



g » 8 
8 g P 



y Bh 



O 



3 



1 h 



03 



IS 



to m 

CO ^ 



#f UP f" 

•n t» 

■I CO it 



CJ*> c# 

t» O t< 

it a m 



01 



" i 3 i 



1 

h 

*M Cm 

g 5 - 
S « 5 
111 

Is 



It § 

•£1 *• 

















_ ., . 


— — 

















Term 


&*!iruse 


A Iv m- 


Ti»trn 




rraqut'itcy 


A;: i tollmen t 










£ 1 






. 1 


.68H«* 


.•J 793 


♦ r ii7i 


.■wi 


.J 


• 5303 


.73UH 


♦ 3*% 


.h750 


.3 


.«*6ao 


.6013 


♦ *H% 


.t>U81 


•4 


.3«*82 


.0205 


♦ ; »9% 


•«4fl07 


,<> 


.313«* 


.«*150 


♦32% 


.i«38t» 


■ h 


.2556 


.3623 


♦«*?% 


.3721 


.7 


• 1980 


.3017 


*52% 


.3357 


.4 


.if»3i 


.1*53 


♦20% 


.2105 


.0 


.12fc5 


.1463 


♦U>% 


.17G8 


i.O 


• 1176 


• 1314 


i l -.X 

♦ l/t 


1 O** ft 






Average 


♦32% 





MF.O <*00 



Asul^nnuMit 



Advan- 
tage 



Slatulard 
1'i.^uoiicy 



r*hra*o Advan- 
Ansi^nmrnt tage 
J 



.8911 


♦12% 


•7V.16 


.8U0B 


♦13% 


.al«*9 • 


♦21% 


.7071 


.8i*19 


♦10% 


.6902 


♦28% 


.6710 


.7998 


♦19% 


• 6U81 


♦35% 


• 6<*52 


.7729 


♦20% 


.5030 


♦35% 


•6351 


.7025 


♦11% 


.5**50 


♦U6% 


.5866 


.6800 


♦16% 


•U867 


t«*S% 


.51*13 


.6331 


♦17% 


.3263 


♦>*9% 


• 500<* 


.5805 


♦18% 


.2767 


♦56% 


.386$ 


•<*618 


♦19% 


.1069 


♦60% 


.3721 


.U529 


♦22% 


AVCIMfiO 


♦39% 




Average 


♦17% 



Average Precision Values at T<*n Recall Points 
(Phrast* Process vs. Standard) 
Table 3 







'TAH «*2«» 




MED U50 


|~ 


TIME 425 


i 




Staivlirl 


~ f- 




Stan laH 


■ l « 

Thenaururt 




Standard 


Theuaurus 






Tern 




A'v.n- 


Thrift 


Plus 


Alv>a- 


Terra 


Pius 


Advan- 




rro^uvti'- j 


l hr iw; 


• • *»» 


rr«"«i»icn»:y 


Phrasal 


ta,»e 


Frequency 


Phrases 


tage 


.1 


.Mu»* 


.n/i»b 


.'/. .\ 


.7*<n 


.8010 


1J.0% 


•7«*06 


.8339 


11.2% 




.510* 


.7038 




.0750 


.8331 


21.i*% 


.7071 


.8138 


15.0% 






.M*7 


>«>. / . 


.SU«1 


.7057 


.?«.«% 


.0710 


.7812 


16.<*% 




. >«*32 


. v*oi 


».\.1", 


.U3Q7 


• 6UU3 


1U.0% 


•W52 


.7€8t 


19.0% 




.3!1»* 


♦ <* r ,16 


••u.r\ 




.6090 


W.1% 


.6351 


.7006 


10.3% 


.r. 


• ? r . r .* 


.3/18 




.3721 


.5518 


UO.1% 


.5866 


.6902 


17.3% 


.7 


.l<>»i'J 


.< i70 


! . 


.33».7 


.5179 


VI. 3% 


.V»13 


.6389 


18.0% 




.1*31 


.'/019 


# '..'J*. 


.2105 


. 30U0 


70. 0% 


.500«* 


.5315 


18.2% 




• 1265 


. 1S'j6 




. 1 V>H 


.3505 


^A.2% 


.3865 


.**BU2 


2S.3% 




• 117P. 


. 1 V/5 


K. *S 


• t?10 


• ?>»8*4 


101 .9% 


.3721 


.1*790 


28.7% 






A v* rage 






Average 


4 52% 




Average 


♦18% 




Av»»t.i. 




♦jv% 


Av 


imjvi (Phraser) 


♦ 39% 


Average (PtaMfle!)) 


♦17% 








> SI 






4 13% 






♦ 1% 



ERJ.C 



A. ■? i Prectaivn Vil-i"* at T*n Rtn.atl Points 
t ri • • turin -it. 1 P! » i .09 v«. ir.d ii'l) 
TiUe '* 



i 

a 



M 



O 



CM 

s 

o 



Of* I 

+1 















o 


m 




V) 


f< 








il 










o 














o 




n 


•H 




tr 




• 












> 





co 

s 

o 
CI 

P4 



Oil 



co 

<D 

co 
o 

•H 

8 

5 



I 

to 
o 

CO 

a 



8 



CO 

> 



CM 
CO I 

+ 1 



CO 
CI 

co 
fd 

& 

cu 

o 
•ti 

§ 

















> 


1 


o 




g 








3- 






• 




vt 




> 





CO 

o 

CO 

i 

•3 

S 
8 

I 



col 



fil 
1 



V) 

> 



I 

O 
•rl 
O 



m 


o 


in 


CO 






• 


• 


* 








H 




H 




H 








o 


o 


O 


o 




CD 














1 


i 






•8 






S3 





o 

•H 
CO 
•H 

£ 



CO 
CO 



to 

o 



to 

CM 



S 



CO 

I 

1 



o 

•H 

i 

g 

•H 
CO 
•H 

a 
a 

& 

II 



§ 

•H 
CO 
•H 

a 

£ 
8 

o 



CO 



I 



CO 



o 



CO 



8 

•H 



